


Practical Neural — 
Networks (2) 


Part 2: Back Propagation Neural Nets 


By Chris MacLeod and Grant Maxwell 





Back Propagation (BP) Networks are the quintessential Neural Nets. 
Probably eighty percent of nets used today are of this type. Actually 
though, Back Propagation is the learning or training method, rather than 


the network structure itself. 


The network operates in the same way as 
the type we've looked at in part 1— you apply 
the inputs and calculate an output exactly as 
described. What the Back Propagation part 
does, is allow you to change the weights, so 
that the network learns and gives you the out- 
put you want. The weights which the network 
starts off with are simply set to small random 
numbers (say between —1 and +1). 


What is BP good for? 


Back Propagation is excellent for simple pat- 
tern recognition and mapping tasks. It learns 
by example. 

To give a typical application, we can train 
a BP network for character recognition. All 
you need to do is give it examples of the char- 
acters, and what output we would like the 
network to have, and it will learn from them, 
see Figure 1. 

The algorithm works by calculating an 
error — which is the amount by which the 
output differs from an ideal value (chosen by 
you, and called the Target), and then chang- 
ing the weights to minimise this error. Once 
the network is trained, it will correctly give 
the output when a character is applied, even 
if the character is distorted, imperfect or 
noisy. In this case, because the Target has 
two bits, we need two output neurons (one 
for each bit). Each input and its associated 
Target is called a Training Pair. 
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Figure. |. Use of a BP network for image recognition. 


What does a BP network 
look like? 


Figure 2 shows a BP network being 
used for Pattern Recognition. 

A common question is: How big 
should the network be? We can see 
from Figure 2 that the number of 
inputs is fixed by the pattern we are 
trying to process. In the case of four 
pixels, there must be four inputs. 
Likewise, the number of output neu- 
rons is fixed by the number of pat- 
terns we what to recognise. If we 
had nine patterns we could either 
use three output neurons and binary 
code their outputs, or we could use 
nine and assign them so that, for 
example, when pattern 2 appears, 


output neuron 2 gives a ‘1’ (and the 
rest are zero). 

This only really leaves the number 
of neurons in the hidden layer to 
decided on. Fortunately, networks 
are quite flexible about this parame- 
ter and will operate over a wide 
range of hidden layer neurons; 
although, the more patterns the net- 
work needs to remember, the more 
neurons you will need. In a network 
designed to recognise all 26 letters 
of the alphabet (26 output neurons) 
on a 5x7 grid (35 inputs), the net- 
work will function with anywhere 
between about 6 and 22 neurons. If 
you have too few, then the network 
hasn't got enough weights to store 
all the information in; if there are too 
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Figure 2. A network wired for recognising patterns. 


many, it becomes inefficient and 
prone to a problem called local min- 
ima (discussed later). 


The BP algorithm 


Now let’s have a look at the training 
algorithm itself. To do this, we'll refer 
to three neurons labelled A,B and C 
in Figure 3. 

The weight that we'll train is that 
between neuron A and neuron B and 
is labelled Wap in the diagram. The 
diagram also shows another weight 
— Wac — and we'll return to that 
one in a moment. 

The algorithm works like this: 


1. First, apply the inputs to the net- 
work and calculate its outputs as 
described last month in Part 1 (this 
is the forward pass). 

2. Next, calculate the output error for 
neuron B. The error is basically: 
What you want - What you get. 
What you want is your target and 
what you get is your output. Mathe- 
matically: 

Errorg = Output, * (1 - Outputs) * 

(Targetp — Outputs) 


The term Outputp * (1 —- Output) is 
Was 
> 
Wac 
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Figure.3. Three neurons which are part of 
a larger network. 
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present because of the effect of the 
sigmoid function — if we were just 
using a binary threshold, we would 
omit it. 

3. Change the weight. Let W+ 4g be 
the new (trained) weight and Wap 
be the original (untrained) weight: 


W* ap = Wag + (Errorg x Output,) 


Notice that we use the error of the 
second neuron (B), but the output of 
the feeding neuron (A). 

4. Change all the other weights in 
the output layer in this manner. 

5. To change the weights of the hid- 
den layers you need to calculate an 
error for the hidden neurons. We do 
this by Back Propagating the errors of 
the output neurons back. For exam- 
ple, suppose we want to calculate the 
error for neuron A. We use the errors 
calculated for all the output neurons 
attached to it, in this case B and C 
and propagate them back — hence 
the name of the algorithm. 


Error, = Output, * (1 - Output,) * 
(Errorg * Wag + Errorg * Wac) 


Again, the Output, * (1 — Output ,) 
serves the purpose noted in 2. 


6. Having obtained the errors for the 
hidden layer neurons, we now pro- 
ceed back to stage 3 and change 
their weights. 


Now this might be a little confusing, 
so let's show a full example, Figure 
4. 


1. Calculate errors of output neurons 

Errorg = Out, (1 - out,) (Target, — 
out,) 

Errorg = outs (1 - outs) (Target, - 





outs) 


2. Change output layer weights 


Wt ag = Wag + NErrorg outa Wtag = 
Wag + NErrorg outa 

W*3q = Wao + NEtrory outz Wte = 
Weg + NErrorg outg 

Woo = Wog + NErrorg oute Wop = 


Wop + NErrorg out 


3. Calculate (back-propagate) hidden layer 

errors 

Error, = out, (1 - outa) (ErroryWag + Errorg_ 
Wag) 

Errorg = outg (1 - outg) (ErrorgWpy, + Errorg_ 
Wsp) 

Errorg = outç (1 - outo) (ErrorgWca + Errorg- 
Wep) 


4. Change hidden layer weights 
Waa = Wag + NError, in, 
Wtoa = Wtga + Error, ing 
W+B = Wag + nErrorg in) 

W* op = Wtopg + nErrorg ing 
Wag = Wyac + NErrorg in, 
Wtoc = Wroc + nErrorc ing 


The constant 7) (called the learning rate, and 
nominally equal to one) is put in to speed up 
or slow down the learning if required. 


Using BP to train a network 


Now that we've seen the algorithm in detail, 
let’s look at how to use it. One of the most 
common mistakes made when programming 
a BP network for the first time is the order in 
which you apply the patterns to the network. 
Let us take an example. Suppose you wanted 
to teach the network to recognise the first 
four letters of the alphabet, placed on a 5?7 
grid. 

The correct way to train the network is to 
apply the first letter, and then change all the 
weights of the network once (i.e., do all the 
calculations in Figure 4, once only). Then 
apply the second pattern and do the same 
again, then the third and finally the fourth. 
Once you've gone through this cycle once 
start all over again with pattern 1. Figure 5 
shows the idea. 

We stop the network when the total error 
is low enough — that is, when the sum of all 
the errors (the positive error from every neu- 
ron, summed over every pattern) is below a 
threshold. This threshold is usually set by the 
user to be some arbitrary low number, like 
0.1. In the example above the total error of the 
network would be: 


(Errors of all neurons in pattern 1) + (Pattern 2 
errors) + (Pattern 3 errors) + (Pattern 4 errors) 
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Before doing this, it is necessary to make all 
the errors positive — we can do this by 
squaring them. 

The learning process is shown in the algo- 
rithm below: 


1. Apply first pattern, perform forward pass, 
perform reverse pass. 

2. Apply second pattern, perform forward 
pass, perform reverse pass. 

3. Apply third pattern, perform forward pass, 
perform reverse pass. 

4. Apply fourth pattern, perform forward 
pass, perform reverse pass. 

5. Test: is total error small enough? If yes, 
then go to 6. 

6. Go to 1. 

7. Stop, network has trained. 


A common mistake to make is running the pro- 
gram on pattern one until the error is low, then 
on pattern two and then on pattern three. If 
you do this, then the network will only learn 
the last pattern you've presented it with. 

Once the network has learned, you can 
apply any of the inputs to it (just apply the 
input and run a forward pass with the trained 
weights) and it should recognise them. We 
can then use the network to recognise pat- 
terns in a real system. 

A more accurate way to train the network 
is to use a validation set. This is similar to the 
set of the patterns which you are training the 
network with — but with noise or other 
imperfections added. After the training set 
has been applied, the validation set is run 
through the network to check its performance 
(we don't use the validation set to change the 
network weights). When the net has fully 
trained both the validation set and the train- 
ing set will give a low error. If you're training 
the network too much, then the validation set 
error will increase as shown in Figure 6. 


Algorithms in software 


In part 1 we discussed various ways of cod- 
ing the network. One way was to store the 
weights in a three dimensional array, with 
indexes denoting the layer number, the neuron 
number and the connection number. A suit- 
able algorithm for a Back Propagation reverse 
pass in such a network might be: 


1. Initialise all unused weights, targets, errors 
and outputs to zero 

2. Calculate output errors, see Listing 1, first 
part. 

3. Change weights, see Listing 1, second 
part. 

4. Calculate error of hidden layers, see List- 
ing 1, third part. 
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Figure 4. All the calculations for a complete reverse pass in a network. 


Where, in addition to the variables 
explained in part 1 of this course, 
E(L,n) and T(L,n) are the errors and 
targets respectively of layer L, neu- 
ron n. 


Putting it all together 


Now that we have algorithms for 
both the forward and reverse pass 
of the network, we can put them 
together into a coherent whole. 
Given below is a suggestion, show- 


Listing | 


ing how this can be done: 


1. Set up inputs and targets for net- 
work (either in a file, or in arrays). 

2. Randomise weights being used. 
3. Apply first pattern, calculate net- 
work output (forward pass) and 
error, use error to change weights 
(reverse pass) — once only. Keep a 
note of the error. 

4. Do the same for second pattern. 
Add error to the running total from 
pattern one. 


FOR x = first_output_neuron TO final_output_neuron_number 


E(output_layer, x) = 


O(output_layer, x) * (1 - 


O(output_ layer, x) * (T(output_layer, x) - 


O(output_layer, x)) 
NEXT x 


FOR L = number of layers TO 1 STEP —1 
FOR n = 1 TO max_number of neurons 


FOR c = 1 TO max number of weights 


W(L, n, c) 
NEXT c 
NEXT n 


= W(L, n, c) + E(L + 1, n) * O(L, c) 


FOR n = 1 TO maximum_number of neurons 


FOR c = 1 TO max_number of weights 


E(L, n) = E(L, n) + E(L + 1, c) * W(L, c, n) 
NEXT c 
E(L, n) = E(L, n) * O(L, n) * (1 - O(L, n)) 
NEXT n 


NEXT L 


Elektor Electronics 2/2003 





Calculate the error and 
change all the weights 
in the network once. 








— > 








Apply this 
letter first. 





Change all the 
weights again 








> 

















Change weights 
and start again 


at A 





weights 


Apply this Apply this 
letter next letter 3°. 
Change 


Finally apply 


this letter. 


Figure 5. How a network learns four patterns. 


5. Repeat for all subsequent pat- 
terns, keep running total of error. 

6. If error is too great (network still 
not fully trained) then zero running 
total and go to 3, else go to 7. 

7. Network is trained and ready to 
be used, either use directly or 
store trained weights in a file for 
future use. 
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Problems and additions 


Although BP is a very useful and 
simple algorithm, it does have some 
problems and limitations. Let’s start 
with its limitations. 

BP is excellent for the sort of sim- 
ple pattern recognition and mapping 
tasks explained above and in the first 
article. However, it only works well 
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when the image it is to recognise is the correct 
size and placed in a central position on the 
grid. It’s no good at, say, recognising a face ina 
crowd — unless you can centre the face or 
make the network ‘scan’ the picture until it falls 
onto the face (and even then you still have to 
make the face the correct size). In other words, 
many problems need to be ‘pre-processed’ 
before being presented to the network. 

So these networks need to operate in a 
controlled environment, which means that 
applications such as Optical Character 
Recognition (OCR) are more suitable. They 
have problems dealing with the crowded and 
confusing real world. 

Incidentally, the human brain solves this 
problem by first identifying ‘features’ in an 
image, for example, horizontal or vertical 
lines and then integrating these progressively 
into a whole image in a layered structure. So 
if you can identify a horizontal line along the 
top of an image and a vertical line down the 
middle, you can integrate these to find the 
letter T. This approach is more tolerant 
because these features (the two lines) are 
always present in T, no matter where its 
placed in the image or what size it is. 

When running your network, you may run 
into problems with its training. The most 
common is known as ‘local minima’. This 
occurs because the algorithm always follows 
the error downwards (it can’t cause a change 
of weights which causes the error to 
increase). But sometimes, as part of a down- 
wards trend, the error must go up as shown 
in Figure 7. In this case the training gets 
stuck and the weights can’t move out of the 
local minima. 

This problem doesn't really effect small net- 
works, but becomes a problem as the network 
size increases. One solution is to add ‘momen- 
tum’ to the network. This involves allowing 
the change of weight to continue for some 
time in a particular direction as shown below: 


New_weight = Old_weight F 
weight_change + Weight_change_from_pre- 
vious_iteration. 


However, a simpler way to overcome this 
problem (and several others which effect 
training) is simply to monitor the training 
progress of the network and if the error gets 
‘stuck’ (does not decrease for some time), 
reset the initial weights of the network to dif- 
ferent random values and start training 
again. 

In next month's instalment of this course, 
we'll have a look at networks which have 
recurrent connections including the famous 
‘Hopfield’ network. 
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